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This paper shows that the e-error capacity (average error probability) for n uses of a 
discrete memoryless channel is upper bounded by the normal approximation plus a term 
that does not exceed ^logn + 0(1) if the e-dispersion of the channel is positive. If the 
e-dispersion vanishes, the e-error capacity is upper bounded by the asymptotic capacity plus 
a constant term, unless the channel is exotic and e > ^. 
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The primary information-theoretic task in point-to-point channel coding is the characteriza- 



INTRODUCTION 



tion of the maximum rate of communication over n independent uses of a noisy channel W . We 
are concerned in this paper with discrete memoryless channels (DMCs). Let M*(W n ,e) resp. 
^max(^"i e ) denote the maximum size of a length-n block code for DMC W having average resp. 
maximal error probability no larger than e £ (0, 1). Shannon's noisy-channel coding theorem [1] 
and Wolfowitz's strong converse [2] state that for every e € (0, 1), 

lim - log M*(W n ,e) = C bits/channel use, 

n— >oo n 

where C := maxp/(P, W) is the channel capacity. Since the mid-1960s, there has been interest 
Q\ ■ in determining finer asymptotic characterizations of the coding theorem. This is useful because 

such an analysis provides key insights into the amount of backoff from channel capacity for block 
' codes of finite length n. In particular, Strassen in 1964 [3] showed using normal approximations 

■ that the asymptotic expansion of log M max (W n , e) satisfies 



(N ■ logM max (F^V) = nC+ sfnV~ e $-\e) + p n , (1) 



where p n = O(logn), V e is the e-channel dispersion [4, 5] and <£(•) is the Gaussian cumulative 
distribution function. These quantities will be defined precisely in Section II A. In fact, this 
asymptotic expansion also holds for M*(W n ,e) [4, Eqs. (284)-(286)] and implies that if an 
error probability of e is tolerable, the backoff from channel capacity C at finite blocklength 
n is roughly ^JV e jn $ _1 (e). There have been several recent refinements to and extensions of 
Strassen's normal approximation in (1), most prominently by Hayashi [6] and Polyanskiy-Poor- 
Verdu (PPV) [4]. Strassen's normal approximation has also been shown to hold for many other 
classes of channels such as the additive white Gaussian noise channel [4-6]. 

Despite these impressive advances in the fundamental limits of channel coding, the third-order 
term p n is not well understood. Indeed, Hayashi in the conclusion of his paper [6] mentions that 

. . the third-order coding rate is expected but appears difficult. The second order is 
the order y/n, and it is not clear whether the third order is a constant order or the 
order log n " 
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What we do know is that for the binary symmetric channel (BSC), p n = \ logn + O(l) [4, 
Thm. 52] and for the binary erasure channel (BEC), p n = 0(1) [4, Thm. 53]. More generally, 
there are classes of channels for which we have bounds on p n [5, Sec. 3.4.5]. For lower bounds 
(achievability), if we consider DMCs W with positive capacity and all elements of the stochastic 
matrix W are positive, p n > |logn + 0(1) [5, Cor. 54]. For upper bounds (converse), if we 
restrict our attention to so-called weakly input-symmetric DMCs [5, Def. 9], p n < \ logn + 0(1) 
[5, Thm. 55]. For constant- composition codes [7], it was shown [8] using strong large-deviation 
techniques [9, Sec. XVI. 7] that, under some regularity assumptions, p n = ^ logn + 0(1). Recall 
that a constant-composition code is one where all the codewords are of the same empirical 
distribution or type. It is also claimed that the same holds for a more general class of DMCs 
in [10]. Our results generalize the converses in [8, 10]. 

This paper strengthens the upper bound (converse) on the third-order term p n . For all DMCs 
whose e-dispersions are positive, we show that 

log M*(W n ,e) < nC + ^JnV e <S>- l (e) + - log n + 0(1), (2) 

If the e-dispersion vanishes, the corresponding bound is log M*(W n ,e) < nC + O(l), unless 
the DMC is exotic [4, Thm. 48] and e > ^. If the DMC is exotic and e = |, we show that 
\ogM*(W n ,\) < nC + ilogn + O(l). If the DMC is exotic and e > ±, logM*(W n ,e) < 
nC + O(ns), a result by PPV [4, Thm. 48]. Hence, for the rather general class of DMCs with 
positive e-dispersion, the third-order term is p n < |logn + O(l). We may thus dispense with 
the assumption that W is weakly input-symmetric [5, Def. 9]. 

The typical way [3-7] to upper bound M*(W n ,e) is to first do the same for the maxi- 
mum size of a constant-composition code under the maximum error probability formulation, 
i.e., Mj^ ax (VF n , e). Such a bound can be proved using either the meta-converse [4, Thm. 31] or 
tight bounds on the type-II error probability in a simple binary hypothesis test [3, Thm. 1.1]. 
By the type-counting lemma [7, Lem. 2.2], every length-n block code can be partitioned into no 
more than (n + l)!^!" 1 constant-composition subcodes. This leads to the rather conservative 
bound [3, Eq. (4.29)] [4, Eq. (279)] 

\ogM^{W n ,e) < nC+ v^* _1 (e) + (Vl - ^ logn + 0(l). (3) 

Subsequently, by expurgating bad codewords (see [4, Eqs. (284)-(286)]), we can conclude that 
the same upper bound holds for M*(W n , e). We adopt a different approach for the proof of our 
main result in (2) and work with M*(W n ,e) directly. In a nutshell, we generalize the converse 
technique in Wang-Colbeck-Renner [11] and Wang-Renner [12], exploit the link [13, Lem. 12] be- 
tween the e-hypothesis testing relative entropy [14] and the relative entropy information spectrum 
[15, Ch. 4] and carefully weigh the contributions of each input type for a general (non-constant- 
composition) code by constructing an appropriate e-net for the output probability simplex. The 
last step, which replaces the use of the type-counting lemma, allows us to bound the effect of 
different input types with the 0(1) term in (2). 

Note that unlike in (3), the third-order term in our upper bound in (2) is independent of 
This is intuitive upon doing the following thought experiment. Let n be a large even 
integer and consider using transmitting information across n uses of a DMC W : X — )• y. 
Clearly, the same amount of information can be transmitted through § uses of the product 
channel W 2 : X x2 — > y* 2 , where W 2 (y, y'\x, x') := W(y\x)W(y'\x'). The capacity and the 
dispersion of W 2 are respectively twice the capacity and the dispersion of W so the normal 
approximation terms for n uses of W and ^ uses of W 2 are identical. If the coefficient of the 
third-order logarithmic term were dependent on the size of the input alphabet, say via some 
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function then for the first experiment, p n = g(\X\)\ogn + O(l) while for the second 

experiment, p n = g{\X\ 2 ) log(^) + O(l) = g(\X\ 2 )logn + 0(1). Thus, at least at an intuitive 
level, we expect that <7(|<-f |) is independent of \X\. 



II. NOTATION AND PRELIMINARIES 
A. Discrete Memoryless Channels 

As mentioned in the Introduction, we consider discrete memoryless channels (DMCs), which 
are characterized by two finite sets, the input alphabet X and the output alphabet y, and a 
stochastic matrix W, where W(y|a;) denotes the probability that the output y £ y occurs given 
input x £ X. The set of probability distributions on X is denoted V{X). For any probability 
distribution P £ V(X), we denote by PxW : (x,y) >— > P(x)W(y\x) the joint distribution of 
inputs and outputs of the channel, and by PW : y >— > ^ P{x)W{y\x) its marginal on y. Finally, 
VF(-|x) denotes the distribution on y if the input is fixed to x. 

p( x) 

Given two probability distributions P, Q £ V(X), we call the random variable log q^yj where 
X has distribution P the log-likelihood ratio of P and Q. Its mean is the relative entropy 



D{P\\Q) ■= E 



,„ g ^]=|> (l ),„ g £M 



and L>(W||(3|-P) := P(x)D(W(-\x)\\Q) is the conditional information divergence. The mutual 
information is I(P, W) := .D(W||PW|P). Moreover, 

C(W) := ^rnax I(P, W) and H(W) := {P € V(X) | J(P, W) = C(W)} 

are the capacity and the set of capacity- achieving input distributions (CAIDs), respectively. 1 
The set of CAIDs is convex and compact in V(X). The unique [16, Cor. 2 to Thm. 4.5.1] 
capacity-achieving output distribution (CAOD) is denoted as Q* and Q* = PW for all P € II. 
Furthermore, it satisfies Q*(y) > for all y £ y [16, Cor. 1 to Thm. 4.5.1], where we assume 
that all outputs are accessible. 

The variance of the log-likelihood ratio of P and Q is the divergence variance 



V(P\\Q) :=E 



log^-D(P||Q) V 



We also define the conditional divergence variance V(W\\Q\P) := ^ x P{x)V(W{-\x)\\Q) and 
the conditional information variance V(P, W) := V(W\\PW\P). Note that V(P, W) = V(Px 
W\\P x PW) for all P £ II [4, Lem. 62]. The e-channel dispersion 2 [4, Def. 2] is an operational 
quantity that was shown [4, Eq. (223)] to be equal to 

V £ {W) := | * £ < I , where V min := min V(P, W) and Vw := max V(P, W) . 

\Vmax lf£>^ Pen Pen 

Furthermore, a channel is called exotic [4, before Thm. 48] if V mSuX = and there exists a 
symbol x £ X such that D{W{-\x )\\Q*) = C and V(W(-|x )||Q*) > 0. 3 



1 We often drop the dependence on W if it is clear from context. 



2 Notice that for e = 5, we set V e = Vm ax . This is somewhat unconventional; cf. [4, Thm. 48]. However, doing so 
ensures that Theorem 1 can be stated compactly. Nonetheless, from the viewpoint of the normal approximation, 
it is immaterial how we choose Vi since 

= (cf. [4, after Eq. (280)]). 

3 Note that this symbol must satisfy P(xq) = for any P G IT, as otherwise Vmax would not vanish. 



4 



For later reference, we also define the third absolute moment of the log-likelihood ratio, 



T(P\\Q) :=E 



log^-D(P||Q) 



and T(W\\Q\P) := ^ x P(x)T(W(-\x)\Q). 

We employ the cumulative distribution function of the standard normal distribution 



d?(a) := 



1 



exp 



1 



-x 



dx 



and define its inverse as := sup{a G R| <&(a) < e}, which evaluates to the usual inverse 

for < e < 1 and continuously extended to take values ±oo outside that range. 

For a sequence x = (xi,x 2 , ■ ■ ■ ,x n ) G X xn , we denote by P x G V{X) the probability dis- 
tribution given by the relative frequencies of x, i.e. P x (x) = ^ Y^=i ^{xi=x}- This probability 
distribution P x is also known as the empirical distribution or the type [7] of x. The set of all 
such distributions is denoted as V n (X) = (J x {-fx} and satisfies 17^(^)1 < (n + l)'*' -1 . 



B. Codes and e-Error Capacity 

A code C for a channel is defined by the triple {A4,e,d}, where A4 is a set of messages, 
e : — > PC an encoding function and d : 3^ — > -M a decoding function. We write \C\ = \A4\ for 
the cardinality of the message set. We define the average error probability of a code C for the 
channel W as 

Perr (C, W) := Pr[M / M'] = 1 - -L ^ ^cTVM™)) 
where the distribution over messages Pm is assumed to be uniform on M, 

forms a Markov chain, and M' thus denotes output of the decoder. The one-shot e-error capacity 
of the channel W is then defined as 

M*(W,e) := max{m G N| 3C : \C\ = m A Perr (C,W) <e}. 

We are also interested in the e-error capacity for n > 1 uses of a memoryless channel. 
For this purpose, we consider the channel W n , defined by the stochastic matrix PF n (y|x) = 
r]"=i W(yi\xi), where x = (xi, x 2 , ■ ■ ■ , x n ) and y = (y 1} y 2) . . . ,y n ) are strings of length n of 
symbols X{ G and yj G 3^, respectively. Then, the blocklength n, e-error capacity of the 
channel W is denoted as M*(W n ,e). 

III. MAIN RESULT AND PROOF 

Let us reiterate our main result. The various cases are illustrated diagrammatically in Fig. 1. 
Theorem 1. For every DMC W and e with V £ > 0, the blocklength n, e-error capacity satisfies 

logM*(PF n ,e) < nC+ ^/nV e $~ l {e) + - \ogn + 0(1). 

IfV £ = 0, we have logM*(W n ,e) < nC + 0(1), unless the channel is exotic and e > \. 
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<nC+VnVe$- 1 (e) + ±\ogn+0{l) [Props. 8 and 10(i)] 



<nC+0(l) [Prop. 9] 




<7iC*+|logn+0(l) [Prop. 10(ii)] 



<nC + 0{nz) [4, Thm. 48] 
FIG. 1. Illustration of the various cases of Theorem 1 and the proof structure in Section HIE 



Remark 1. For exotic channels and e > \, PPV showed the converse bound log M^ iax (W n , e) < 
nC + O(ns) [4, Thm. J^8]. It was also shown, via an example, that the O(ns) term cannot be 
improved in general [4, App. H]. 



Remark 2. The e 

V K 



(so 



7} case needs to be treated with care. For exotic channels and £ = \ 
- 0), we show that log M*(W n ,e) < nC + \ log n + 0(1). In fact, for all DMCs W with 
Vmin = and e = \, we show that logM*(W n ,e) < nC + i logre + O(l). See Proposition 10. If 
Vmax > 0, the latter statement concurs with the positive e-dispersion case of Theorem 1. 



Remark 3. From the preceding statements, we see that for DMCs with V m \ n = and V max > 0, 
the third-order term "jumps" from to ^logn when e f \. This is because we do not investigate 
the dependence of the constant term on e. If we did, for the case V m i n = 0, V mSuX > and 
e = (i) , we would notice that the constant term diverges as e f i. 

In light of the existing results on p n (in the Introduction and [5, Sec. 3.4.5]), the third order 
term is the best possible unless we impose further assumptions on W . 

The proof consists of five parts, each detailed in one of the following subsections. In the first 
subsection, we introduce two entropic quantities, the hypothesis testing divergence [11-14] and 
a quantity related to the information (or divergence) spectrum [15, Ch. 4]. We state and prove 
some useful properties we need later. In the second subsection, we derive a converse bound, valid 
for general DMCs, that involves a minimization over output distributions and maximization over 
input symbols. In the third subsection, we choose an appropriate output distribution for use 
in the general converse bound. In the fourth subsection, we state and prove some continuity 
properties of information measures around the CAIDs and the unique CAOD. Finally, the fifth 
subsection contains the proof of our main result. 



A. Hypothesis Testing and the Information Spectrum 

We use the following divergence [11—14], which is closely related to binary hypothesis testing. 
Let e € (0,1) and let P,Q € V(Z), where Z is finite. We consider binary (probabilistic) 
hypothesis tests £ : 2 — )■ [0, 1] and define the e-hypothesis testing divergence 

D%{P\\Q) :=sup{i?GE|3^: E [£(Z)] < (1 - e) exp(-i?) A E [£(Z)] > 1 - e]. 

Note that D £ h (P\\Q) = - log ^-^ p ^ where j3 a is defined in PPV [4, Eq. (100)]. It is easy to 
see that Df(P\\Q) > 0, where the lower bound is achieved if and only if P = Q and D E h (P\\Q) 
diverges if P and Q are orthogonal. It satisfies a data-processing inequality [11] 

Dh(P\\Q) > Dh(PW\\QW) for all channels W from Z to Z' . 
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When evaluated for independent and identical distributions (i.i.d.), its asymptotic expansion in 
the first order is determined by the Chernoff-Stein Lemma [7, Cor. 1.2], yielding Df i (P xn \\Q xn ) = 
nD(P\\Q) + o(n) for any e G (0, 1). This analysis was tightened by Strassen [3, Thm. 3.1] and 
he showed that 

1 











|i? G R 


Pr 






P 







L) s ( pxn||gxn ) = nD ( P \\Q) + y/ n V(P\\ Q)^ 1 (e) + - log n + 0(1). 

The following quantity, which characterizes the distribution of the log-likelihood ratio and 
is known as the relative entropy information spectrum or the divergence spectrum [15, Ch. 4], is 
sometimes easier to manipulate and evaluate. 

D £ S {P\\Q) :=sup{i?G 

It is intimately related to the e-hypothesis testing divergence, as the following lemma shows. 
Lemma 2. For any 5 £ (0, 1 — e), we have 

Dl(P\\Q) - log -L- < DUP\\Q) < D*+ S (P\\Q) + log (4) 
1 — e o 

These relations follow from standard arguments relating binary hypothesis testing and the log- 
likelihood test to the relative entropy information spectrum. In [13, Lem. 12], an analogue of 
the above lemma is shown for the strictly more general non-commutative case. For completeness 
we show the second inequality, which we will employ later. 

Proof of Second Inequality in (4). If Df(P\\Q) is infinite, P is not absolutely continuous with 
respect to Q and it is easy to see that D £ s +S (P\\Q) is also infinite. Hence, the second inequality 
in (4) trivially holds. We thus consider the case where D^PWQ) is finite, and fix any optimal 
test f for D £ h (P\\Q). Set R* := D e h (P\\Q) + log j^. We find 



Pr 

I' 



> 



> 



Yl 1 {P(z)>exp(R*)Q( Z )} 

~ eX P( R *)Q( Z )) 1 {P(z)>eK P (R*)Q(z)} 

£(P(z)-exp(iT)Q(z))£(z) 



z£Z 

= E[£(Z)]-exp(iT)E[£(Z)] 
> l-e-<5. 

In the last step we used the fact that £ is an optimal test, which implies that Ep[£(Z)] > 1 — e 
and Eq[£(Z)] < (1 - e) exp ( - D e h (P\\Q)). Thus, D e s +6 {P\\Q) > R* , concluding the proof. □ 

We can give an upper bound on D e s (P\\Q) if Q is a convex combination of distributions. 

Lemma 3. Let P G V(Z) and Q = Yl,i&x l{i)Q l with Q l £ "P{^) an d Q G and I is some 

countable index set. Then, 



DI(P\\Q) < inf {DI(P\\Q*) - log 
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Proof. Note that for all z € 2, for all i EX, we have 

P(z) 



. P(z) . P(z) 
lo S 7TTT = lo § 



P(z) 

v < log = log — — - - log q(i). 

Q( z ) 22j qOjQ 3 ( z ) q{vQ\z) Q i (z) 



Hence, 









Pr 




> Pr 


p 




P 



1o Sq? ^ P + log<?(i) 

and, relaxing the optimization in the definition of D^, we get D £ (P\\Q) < Dg(P\\Q l ) — log q(i) 
as desired. □ 

The following property will be particularly useful, as it allows us to bound the log-likelihood 
ratio of the input-output behavior of two channels in terms of the log-likelihood ratio evaluated 
for a single input symbol. 

Lemma 4. Let P £ V(X) and let V, W be channels from X to y. Then, 
D £ (PxW\\PxV) < sup D £ (W{-\x)\\V{-\x)). 

x:P(x)>0 

Proof. We first note that the log-likelihood ratio takes on the form 

PxV P(x)V{y\x) V{y\x) 

for every (x,y) £ X x y satisfying P{x) > 0. Now, we may write 



R* = D £ s (PxW\\PxV) =sup jPG 

= sup < R £ 



r PxW 

Pr log - - < R 



PxW 



PxV 



< £ 



V P(x) Pr 
W(-\x) 



, W(-\x. 
log „;,' r < R 



V(-\x) 



< £ 



Inspecting this expression, for any if > 0, we find at least one x* 6 X such that 

< e. 



P(x*) > and Pr 
Hence, D £ s (W {-\x*)\\V {-\x*)) > R* — if, which implies the lemma as ip is arbitrary. 



□ 



The distribution of the log-likelihood ratio has the following asymptotic expansions for not 
necessarily identical product distributions. 

Lemma 5. Let Pi,Q £ V(Z) be such that Pi <C Q for all i in some finite set I. We consider a 
sequence of distributions Pi k indexed by 12, ■ ■ ■ , i n ) where i^ £ I for each 1 < k < n. Define 



1 1 1 

D * : = -E^H^)' V n-=-Y, V ( P Mi and T n := -Y,T(P lk \\Q) 

k=l k=l k=l 

IfV n >0, then we have the Berry- Esseen-type bound 

D £ s (P tl x . . . P in ||Q*») < nD n + y/nK®- 1 (e + -^=) . 
In any case, we have the Chebyshev-type bound 



D: 



{P il x...P in \\Q xn ) <nD n + J- 



riV n 



(5) 
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Proof. We consider the cumulative distribution of the random variable S n := Ylk ^°sPi k (-^ik) ~ 
log Q(Xi k ) where each Xi k has distribution Pj,. The random variable S n has mean nD n and 
variance nV n . The general case, Eq. (5), is shown using Chebyshev's inequality, which yields 



e > Pr 



5> 



Pi 



> i 



nl4 



(R - nD n y 



for > nD r , 



Hence, restricting to R > nD„ and relaxing the condition of the supremum, we find 
D £ s {P n x... P in \\Q xn ) < sup {R > nD n 



nV„ 



< e} = nD n + Jj 



nV n 



(R-nD n ) 2 ) Vl- 
Furthermore, if V^,, > > 0, the Berry-Esseen theorem [9, Sec. XVI. 5] states that 



Pr 



P, 



£ l0g Q * R 



R — nD n 
VnVn 



< 



6T n 



Hence, we obtain 



Dl (P h x . . . P ln \\Q xn ) < nD n + v^K^-i L + 1^-] , 



which concludes the proof. 



□ 



B. Converse Bounds on General Channels 



Here, we give a new converse bound on the code size for general channels. 

Proposition 6. LeteG (0,1) and let W be any channel from X to y . Then, for any 5 £ (0,1— e), 

we have 

logM*(W,e)< inf sup D £+s (W (■\x)\\Q) + log \. 

The first part of the proof is similar to the meta-converse of PPV [4, Thm. 31]; however, we 
give a conceptually simple alternative proof along the lines of Wang-Colbeck-Renner [11, Lem. 3] 
and Wang-Renner [12, Thm. 1]. 

Proof. For any code C = {M,e, d} with p erT (C) < e and any Q £ V(y), the following holds. 

Starting from a uniform distribution over M, the Markov chain M — > X Y M' 

induces a joint probability distribution Pmxym'- Due to the data-processing inequality for D £ h , 
we immediately find D £ h (P xW\\P xQ) = D £ h (Px Y \\Px xQy) > P>\[Pmm' \\Pm X Qm'), where 
Px = P and Qm' is the distribution induced by d applied to Qy = Q- 4 Moreover, using the 
test £(m,m') = S mim ', we can readily see that 

E [£(M, M')] = Pr [M = M'\ > 1 — e and E [£(M, M')] = -L. 

"a/A/' "MM' r M YQ M i |L| 

Hence, D\{Pmm' \\Pm x Qm 1 ) > log |C| +log(l— e) by definition of the e-hypothesis testing 
divergence. Finally, applying Lemmas 2 and 4, we find 

suj>D £ s +5 (W{-\x)\\Q) > D £+5 (PxW\\PxQ) 

> D £ h (PxW\\PxQ) -logi^ > loglCl -log T . 

11 

This yields the converse bound upon minimizing over Q £ V(y). □ 

4 Note that due to the Markov property, the encoding can be inverted probabilistically, without effecting the 
correlation between M and M' . 
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Q(i) 




0(0) 



(0,1) 

FIG. 2. Illustration of the choice of Q k for y = {0, 1}. Note that C = 2 for \y\ = 2. 



C. A Suitable Choice of Output Distribution Q 

For n-fold repetitions of a DMC, the bound in Proposition 6 evaluates to 

logM*(W n ,e) < min max D £ s +S (W n (■\x)\\Q ( - n) ) + log -, 

Q( n )£-p(y xn ) x€X* n o 

and it thus important to find a suitable choice of £ "P(3^ xn ) to further upper bound the 
above. Symmetry considerations allow us to restrict the search to distributions that are invariant 
under permutations of the n channel uses. Let £ := |^|(|^| — 1) and let 7 > be a constant 
which is to be chosen later. Consider the following convex combination of product distributions: 

ke/c i=l PxePn(A-) 1 ny n i=l 

where F is a normalization constant that ensures ^ y (y) = 1 and 

Qk(lO :=Q*(y) + i, /C:={kGZ^| ^^ = 0AA:, >-Q*(y)v^<}. 



The convex combination of (P x VF) xn in is inspired partly by Hayashi [6, Thm. 2]. What 
we have done in our choice of Qk is to uniformly quantize the simplex V(y) along axis-parallel 
directions. The constraint that each k belongs to JC ensures that each is a valid probability 
mass function. See Fig. 2. We find that 

/ 00 \\y\ / !-\\y\ 

F< £ exp ( - 7 ||k|H) = £ex P (- 7 fe 2 ) < 1 + 

kezl^l \fc=-oo / V v V 

is a finite constant. Furthermore, by construction, the representation points {Qk}k form an e-nei 
withe = n~2 for"P(^V). Namely, for every Q E P(3^)> there exists a k such that ||Q — Qklb <n~2. 
This can be verified easily since by choosing a k that minimizes the distance in all but one 
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FIG. 3. Illustration of the sets in Section HID for \X\ = \y\ = 3. Here. II is not a singleton and ILjW 
has measure zero in V(y) so W is rank-deficient. The unique CAOD Q* is the image of II under W, 
H^W is the image of 11^ under W and TJJ is the "jj-blown-up" version of II^W. 



direction (say the last), yielding 



|yi-i 

IIQ-Qklll = E (Q(y) - QUv)) 2 + (Q(\y\) - QU\y\)) 2 

y=l 

\y\-i /\y\-i \2 

= E (^(f) - Q^) 2 + E ^) - so/) 



j/=i 



1 

n 



D. Continuity around the CAIDs and the unique CAOD 



We will often be concerned with probability distributions close to the set of CAIDs IT in 
Euclidean distance, i.e., those distributions belonging to 



n M := [p e v(x) 



min IIP - P*\\ 2 <a 
p*eu J 



for some small /i > 0. Sometimes we also need to restrict to probability distributions in IT^ with 
positive conditional information variance. For a constant v > we define 



n£ := [P eIl^\V{P,W) >v}. 



The image of LT^ under W is denoted as ILjVF. We also consider a larger, "n-blown-up" 
version, of LT^ W, namely 

== {Q e 1 3 P e n M s.t. ||pw - Q|| 2 < n}. 

Note that T° = LT^ VF if the stochastic matrix W has full rank. See Fig. 3 for an illustration. 
The following Lemma summarizes known results about these sets. 

Lemma 7. Let W : X — >■ ^ 6e a DMC and v > be a constant. There exists ji > and 77 > 
and finite constants V + > 0, T + > 0, q m i n > 0, a > 0, and /3 > such that the following holds. 
For all P G 11^ and i/ieir projections P* : = argminp/ gn ||P — P'||2 and a// Q € iwe /lawe 
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1. Q{y)> q m - m for all y ey, 

2. V(W\\Q\P) > ¥f±, 

3. I(P,W) < C(W) - a\\P - P*\\l, 

I D(W\\Q\P)<I(P,W)+ llQ -™ ll K 

5. V{W\\Q\P) < V + and T(W\\Q\P) < T+. 
Furthermore, for any P G IP we have 

6. V(W\\Q\P) > § > 0, 

7. \y/V(P,W)-y/V(P*,W)\<P\\P-P*\\2, 

8. yV{W\\Q\P) - y/V(P, W) I < p\\Q - PW\\ 2 . 

Proof. Properties 1 and 2 hold for small enough fj, and r\ by continuity since Q* has full support 
[16, Cor. 1 to Thm. 4.5.1] and V(W\\P*W\P*) > V min . The case V min = in Property 2 is trivial 
since V(W||Q|P) > 0. Property 3 was established by Strassen [3, Eq. (4.41)] as well as PPV [4, 
Eq. (501)]. Since D(W\\Q\P) = I(P,W) + D(PW\\Q), Property 4 follows immediately from 
the fact that D{PW\\Q) < min 1 Q ( y ) \\ PW ~ QWl ( see . e -S-> [ 17 > Lem - 6 - 3 D- Property 5 follows 
from the fact that (P,Q) y(M^||Q|P) and (P,Q) ^ T(W\\Q\P) are hnite and continuous on 
the compact set 11^ x T 1 ^. 

Property 6 again holds for small enough r] by continuity and since y(M^||P*iy|P) > v by 
definition of the set T 7 ^. To verify Properties 7 and 8, note that the quotient W(y\x)/Q(y) < oo 
by Property 1. If W(y\x)/Q(y) = 0, the corresponding terms in the sums defining V(P, W) and 
V(W\\Q\P) are excluded because log fc ■& as ■& -»• for all k > 0. Hence, P ^ V(P, W) and 
Q i — y V(W\\Q\P) are continuously differentiable on ILj and respectively. Because t i— > \ft is 
continuously differentiable away from 0, by Property 6, P i— > y/V(P, W) and Q i— > y^VCWWQlP) 
are Lipschitz on 11^ and T 1 ^ respectively. The uniformity of f3 in P in Property 8 can be verified 
by explicitly calculating the derivative of Q t— > y / V(W\\Q\P) and noting that it can be upper 
bounded by a finite constant independent of P. □ 



E. Asymptotics for DMCs 

We are now ready to prove our main result. Several special cases of Theorem 1 require 
additional proof techniques. For the convenience of the reader, we state them separately as 
propositions. Theorem 1 then follows as a straightforward consequence of these propositions. 
See Fig. 1 for a summary. The following proposition considers the "regular" case, where the 
channel and e satisfy V e > 0. 

Proposition 8. For every DMC W and e G (0, 1) such that V £ > 0, the blocklength n, e-error 
capacity satisfies 

log M*{W n ,e) < nC + ^/nV e $~ l {e) + - log n + 0(1). 

Remark 4. In the following proof of Proposition 8, we deal with all cases except e = ^, V m i n = 
and Vmax = V £ > 0. This special case will be handled in Proposition 10 (i) as it uses the proof 
techniques in Proposition 9. 
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Proof. Firstly, we employ Proposition 6 to provide a bound on log M*(W n , e). We choose 
S = n~2, which satisfies < 5 < 1 — e for sufficiently large n. Substitute the output distribution 
Q( n) in (6) to get 

logA/WV)< max Dl +S (W n (-\x)\\Q {n) ) +-logn. 

= : cv(x) 

It remains to show that each term cv(x) in the maximization is upper bounded by nC + 
yJnV^" 1 (e) + G for a suitable constant G for all sufficiently large n. 

We apply Lemma 7, which supplies us with finite, positive constants fx, r], V + , T + , q m i n , a 
and f3. If V m i Q > 0, we choose v = ^p- such that 11^ = ILj, otherwise v > will be specified 
later. See Case c) below. 

We distinguish between three cases for the following; either a) x satisfies P x ^ 11^ or b) x 
satisfies P x 6 11^ or c) x satisfies P x 6 11^ \ 11^. Note that Case c) is only relevant if Vmin = 0, 
as otherwise 11^ = 11^ by definition of v. This strategy in which we partition input types into 
such classes was proposed by Strassen [3, Sec. 4]. See also PPV [4, App. I]. Intuitively, for 
Case a), P x is far from the CAIDs so the first-order term is smaller than capacity; for Case b), 
P x has high conditional information variance and thus bounded skewness so we can apply the 
Berry-Esseen-type bound of Lemma 5 and; for Case c), P x has small conditional information 
variance so we must use the Chebyshev-type bound and choose v based on 7 max instead of V m i n . 



Case a): P x ^Hfj 



The mutual information outside 11^ is bounded away from the capacity, i.e., I(P^,W) < 
C' <C for all P x g n M . 

We first apply Lemma 3 and then Lemma 5 to bound 



cv 



(x) < D £ s +s (W n (-\ X )\\ (PkW)™) + log (2 \V n (X)\) 
< n/(P X) W) + 



=g^ + .o g (2|^)|). 
For the second inequality, we note that D n in Lemma 3 evaluates to 



D r , 



1 

-E E 

n f-f w(-\xi) 

1 = 1 



log 



W(-\ Xi 



E 

PxXVK 



log 



W 



D(W\\P X W\P X ) = I(P X ,W), 



and similar calculation can be done to show that V n = V(P X , W). Invoking [4, Lem. 62] and [15, 



Rmk. 3.1.1] yields the uniform bound V(P X) W) < 



8 log^ e 



\y\ < 2.3 \y\. Hence, 



cv(x) < nC + \fn 



2.3 \y\ 

l-e-6 



+ (\X\ - 1) log (n + 1) +log2. 



Since C' < C, the linear term dominates the term growing with the square root of n and the term 
growing logarithmically in n asymptotically. Hence, it is evident that cv(x) < nC + \/ nV £ $> (e) 
for sufficiently large n. 



Case b): P x G II£ 

For each x, we denote by Qk(x) t ne element of the e-net (constructed in Section III C) closest 

to P-xW . We note that since ||Qk(x) ~~ Px^Flh < e = n~z, we have Qk(x) 6 for sufficiently 
large n, which enables us to apply the properties described in Lemma 7 extensively below. 
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We first use Lemma 3 to bound 

cv(x) < D s e + a (^(.|x)||(Q k(x) )X") + 7 ||k(x)|||+log (2F). 

We now employ Lemma 5, where we choose Pj = W(-|xi) resulting in D n := -D(W||Q k ( x )|P x ), 
V n := V{W\\Q^)\P X ) and T n := T(iy||Q k(x) |P x ). From Lemma 7, we have that T n < T+ 

and < | < V n < V + . We then introduce the finite constant B := 1 + 6y/8T+/v2 , while 

substituting for 5 = n~ 3 , to get 

cv(x) < nL>(W||Q k(x) |P x ) + ^(WHQk^lP,)^ 1 fe + + 7l|k(x)||l + log (2P). 

We now require that n > N, where iV is chosen large enough such that e+ < 1. This ensures 
that the coefficient of the term growing as y/n in the above expression is finite. Next, we use 
the fact that is infinitely differentiable and V(W||Q k (x) I -fx) < ^+ is finite to bound 

/^(im^r 1 f e + JLj < ^(^HQk^ip,)^ 1 ^) + Gl 

for some finite constant Gi and all n > N. Thus, defining G2 := G\ + log(2P), we get 



cv(x) < nD(W\\Q Hx) \P x ) + ylnV{W\\Q K x)\Px)*-\e) + G 2 , 

Next, we would like to replace Q k ( x ) with P X W in the above bound. This can be done without 
too much loss due to Lemma 7, which states that 

\\P X W - Qu^Wl 1 

d{w\\Qh*) l p x) < /(P x , w) + ^ < /(p x , + 

9111111 W Qmin 

and 



^(W||Q k(x) |P x ) - v/n^W) < /3||P X W - Q k(x) || 2 < -£= 



Hence, choosing G3 := — h /3|<3? 1 (e)| + G2, we find that 

ymin I ' 



cv(x) < nI(P x ,W) + v^PxTl 7 ) + 7||k(x)||l + G 3 . 

In the following, we use the fact that all distributions (and types) P x in ILj satisfy I(P X , W) < 
G - a£ 2 and | ^/V(P X , W) - y/V(P*,W)\ < ft, where P* := arg min P , gn ||P X - P'|| 2 (which is 
unique) and £ := ||P X — P*||2- Hence, 



cv(x) < nG + V / ^H P ^ W)<S>~ l {e) + ( - af 2 n + P\$-\e)\Zy/n + 7||k(x)||| ) + G 3 . (7) 



It thus remains to show that the term in the bracket is upper bounded by a constant, for 
an appropriate choice of 7. Let ||W||2 : = max{||uW|| 2 | ||u|| 2 < 1} be the spectral norm of 
the matrix W. It is easy to see that ||W||2 < \/\^\ ■ From the construction of the e-net in 
Section III C, 

||k(x)|| 2 = V<IIQk(x)-Q*l| 2 

< 7<(llQk(x) - PkW\\ 2 + \\P X W - Q*\\ 2 
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Substituting this bound into (7), we find that the term in the bracket evaluates to 
{lC\\W\\ 2 2 -a)£ 2 n + + 2 7 C||^|| 2 )Cv^ + 7C 

The expression is a quadratic polynomial in £y/n and has a finite maximum if we choose 7 
such that 7C||W / || 2 . < a - Hence, we can write 



cv(x) < nC + v / ny(P*,W)$" 1 (e) + G 4 
for an appropriate constant G4 and n > N. 

Case c) P x G n M \ II£ 

Note that this case only appears if V m i n = 0, V max = V £ > and e > i. We consider the case 
e > 5 (cf- Remark 4) leaving the e = ^ case for Proposition 10(i). We have 

cv(x) < D £ + s (W n (-\x)\\(P x W) Xn ) +]og(2\V n (X)\) 



TIV 

< nI(P x ,W) + \l 1 _ £ _ s +log(2\V n (X)\) 



Now we choose v > to be any constant satisfying 



I — e — 5 \fn 



It is certainly possible to find such a v since the number of types is polynomial so 5 and the 
second term on the left are arbitrarily small for large enough n. Furthermore, \/V mSiX ^" 1 (e) > 0. 
This is where e 7^ ^ is crucial. Uniting the preceding two bounds yields 



cv 



(x) < n/(P x , W) + VnV m ^\e) < nC + V^Ur 1 ^). 



Summarizing the bounds for Cases a), b) and c), we thus have the following asymptotic 
expansion for all n sufficiently large: 

1 



log M*(W n ,e) < max nC + nV {P* ,W)^~ 1 {e) + -logn + G 4 



= nC+ y/W e ^~\e) + -logn + G 4 , 

where the last equality follows by definition of V e . □ 

Surprisingly, the first order approximation is accurate up to a constant term if = unless 
the channel is exotic and e > |. 

Proposition 9. For every DMC W and e £ (0, 1) such that V £ = 0, the blocklength n, e-error 
capacity satisfies log M* (W n , e) < nC + 0(1), unless the channel is exotic and e > ^. 

Proof. Again, from our bound on the converse for general channels (Prop. 6), we have 

logM*(W n ,e)< max D £ + 5 (W n (-\x)\\Q {n) ) + log \. (8) 
xeAfx™ s v ✓ d 

=: cv(x) 
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For this analysis, we ignore Section III C and instead choose = (Q*) xn to be the ra-fold 

extension of the CAOD. We choose $ = \ — £ if £ < \ and 5 = otherwise; hence, the term 
log t is finite and independent of n. Since F is bounded and k = for (Q*) xn , it remains to 
prove that cv(x) < nC + 0(1) for all x. 

For this purpose, let m(x) be the number of non-zero variance letters in x, i.e., m(x) : = 
nP x (X + ) = £?=i l{ XiE x + } where X+ := {x G X : V(W(-\x)\\Q*) > 0}. There exist finite 
constants v m in 5 i , max and i max such that, for every x G X + , 

< v min < V(W{-\x)\\Q*) < Wmax , and T{W(-\x)\\Q*) < t max . 

By the definition of D n := D(W\\Q*\P X ), V n := V{W\\Q*\P X ) and T n := T{W\\Q*\P X ) 
(cf. Lemma 5), we have 

m(x) m(x) ?n(x) 

^min ^ Ki _ "maxi and T n < ^max' (9) 

n n n 

3 

Further defining B n := 6T n /Vn , we thus find 



„ / j L where L := < oo. 



mm 



Let m* be an integer satisfying L/y/rnF < r' where r' is chosen such that <5 _1 (^ + r) < 3r 
for all r G [0, r']. The choice r' = 0.35 does the job. 

For e < i, following Strassen's argument [3, Eq. (4.53)-(4.54)] (see also PPV [4, App. I]), 
we distinguish between two classes of sequences as follows: the sequence x satisfies either a) 
m(x) > to*, or b) m(x) < m*. Finally, c) considers the case where W is not exotic and e > |. 
Intuitively, for Case a), we can use the Berry-Esseen-type bound because m(x) is large, and 
hence B n can be bounded appropriately; for Case b), we use the Chebyshev-type bound because 
m(x) is small and; for Case c), we use the non-exoticness of W to bound D n far away from C. 

Case a): £ < \ and m(x) > m* 
We apply the Berry-Esseen-type bound in Lemma 5 to (8) to find 
r(x) < nD n + y^Y^' 1 (e + 5 + ^) 



cv( 



1 L \ nT nV n 



^2 0n(x) J V m(x) 

Here, we used the fact that e + 5 = ^ by definition of 5 and the proof concludes with the 
observation that < !) ma x is bounded by a constant, and Z) n < C for all x. 

Case b): e < \ and m(x) < m* 
We use the Chebyshev-type bound in Lemma 5 to (8) yielding 



nV n 



cv(x) < nD n + J - _ g "_ - = nJ) n + V^K- (11) 
Since by (9), nV^ < m*u max and Z) n < C for all x, we find the desired bound. 
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Case c): not exotic, e > | 
Lemma 5 applied to (8) again yields 



, nV n l2nV n 
cv(x) < nD n + \ - = nD n + 



l-e-5 " V 1-e 

because in this case, S = By virtue of the fact that 7 max = and W is not exotic, we have 
that either 

D(W(-\x)\\Q*) < C or V(W(-\x)\\Q*) = (12) 

for all symbols x € X . If ^ + is empty, we have V n = and the bound is immediate. Otherwise, 
we define ip := C — max Ig ^ + D(W(-|:e)||(5*) > 0, which is positive due to the condition in (12). 
Using this, we find that nD n < nC — m(x)^ and nV n < f max m(x) by (9). Thus, 



/2m(x)t) 1 

cv(x) <nC - m(x)ip + ■ 



1-e 



The latter two terms constitute a quadratic polynomial in yj m(x), and hence, their sum has a 
finite maximum. □ 

Finally, we deal with the case that was left out in Proposition 8. 

Proposition 10. Let s = \. The following hold: 

(i) For every DMC W such that Vmin = and V mSuX > 0, the blocklength n, e-error capacity 
satisfies \ogM*{W n ,e) < nC + \ logn + O(l). 

(ii) For every exotic DMC W (in particular, V mSuX = 0), the same bound as in (i) holds. 

Proof. By placing no assumptions on Vmax > 0, we can prove both parts in tandem. The proof 
follows closely that of Proposition 9 with the exception that we choose 5 = n" so the log ^ 
term evaluates to |logn. It remains to show that cv(x) < nC + 0(1). We split the analysis 
into Cases a) and b) as in Proposition 9 and let D n := D(W\\Q*\P X ) and V n := V{W\\Q*\P X ). 

Case a): e = h, Vmin = and m(x) > to* 
By the same steps that led to (10), we have 

cv(x) < nD n + 3 (L + 1) 



nV n 



because 5 = n 2 . We obtain the desired bound by noting that n Y\ < v max and D n < C. 



m(x) 

»n(x) 

Case b): e = h, V m i n = and to(x) < 



By the same steps that led to (11), we have 



cv(x) < nD n + \f 4nV n 

because 1 — e — 5 = \ — 5 > j for all n > 4. The proof is completed by noting that nV n < m*v max 
and D n <C. □ 

Proof of Theorem 1. The first statement follows by Propositions 8 and 10(i). The second state- 
ment follows by Proposition 9. □ 
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